[LEADS-348] evaluation latency#235
Conversation
WalkthroughThis PR refactors evaluation result timing from a single ChangesTiming Model Refactoring
🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
3eed19a to
14e437e
Compare
8eb287c to
e9c4b17
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lightspeed_evaluation/core/output/generator.py`:
- Line 817: The JSON summary currently adds "evaluation_latency":
r.evaluation_latency but removed the older "execution_time" field; restore
compatibility by including "execution_time" in the emitted dict (e.g., set
"execution_time": r.execution_time or, if r has no execution_time, set it to the
same value as r.evaluation_latency) alongside "evaluation_latency" in the output
constructed where the dict containing "evaluation_latency" is created so older
consumers still receive execution_time.
In `@src/lightspeed_evaluation/core/storage/sql_storage.py`:
- Line 60: The SQL schema and persistence code currently define/persist only
evaluation_latency, dropping the previously stored execution_time; add back
execution_time = Column(Float, nullable=True) alongside evaluation_latency in
the model definition and update the persistence logic that currently writes
evaluation_latency (e.g., in the function that inserts/updates evaluation
records—look for code referencing evaluation_latency around the save/insert
method) to also set execution_time from the same source value so existing
consumers continue receiving execution_time.
In `@tests/unit/pipeline/evaluation/test_errors.py`:
- Line 160: The test currently asserts results[0].evaluation_latency but misses
verifying the backward-compatible execution_time field on error results; update
the test in tests/unit/pipeline/evaluation/test_errors.py to also assert that
results[0].execution_time equals 0.0 alongside the existing evaluation_latency
assertion so error results explicitly include the legacy execution_time field.
In `@tests/unit/pipeline/evaluation/test_evaluator.py`:
- Around line 583-605: The test
test_execution_time_conversation_level_averages_agent_latency incorrectly
asserts result.agent_latency == 4.0; update the assertion to expect the average
agent latency of the conversation (2.0) since TurnData instances have
agent_latency 1.0 and 3.0 and EvaluationRequest.for_conversation should compute
the mean; change the assertion referencing result.agent_latency in this test to
assert result.agent_latency == 2.0 and keep the other
execution_time/evaluation_latency checks unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 56132854-62fe-4236-8dc5-3380d17e64a8
📒 Files selected for processing (15)
config/system.yamlsrc/lightspeed_evaluation/core/constants.pysrc/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/core/output/generator.pysrc/lightspeed_evaluation/core/storage/sql_storage.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pytests/script/conftest.pytests/unit/core/models/test_data.pytests/unit/core/models/test_summary.pytests/unit/core/output/test_generator.pytests/unit/core/storage/test_sql_storage.pytests/unit/core/system/test_loader.pytests/unit/pipeline/evaluation/conftest.pytests/unit/pipeline/evaluation/test_errors.pytests/unit/pipeline/evaluation/test_evaluator.py
e9c4b17 to
457af2f
Compare
asamal4
left a comment
There was a problem hiding this comment.
Thanks. A minor typo !!
457af2f to
39af396
Compare
39af396 to
6d2fd1d
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/unit/core/storage/test_sql_storage.py (1)
269-314:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAdd assertion for
execution_timein comprehensive storage test.The test
test_all_fields_storedverifies that allEvaluationResultfields are persisted correctly. It setsevaluation_latency=2.5(line 283) and assertsrow.evaluation_latency == 2.5(line 312), but doesn't verifyexecution_timeis stored correctly.Per lines 417-428, both
evaluation_latencyandexecution_timeare DB columns. This test should assertrow.execution_timeto ensure backward compatibility and verify the relationship between timing fields is preserved through storage.🔍 Suggested assertion to add
assert row is not None assert row.conversation_group_id == "conv_full" assert row.score == 0.92 assert row.evaluation_latency == 2.5 + assert row.execution_time is not None # Verify backward compat field is stored assert row.tool_calls == '[{"name": "search"}]'If
execution_timeshould equalevaluation_latency + agent_latency, verify that relationship:# Assuming agent_latency defaults to 0 when not set assert row.execution_time == 2.5 # or appropriate computed value🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/core/storage/test_sql_storage.py` around lines 269 - 314, The test_all_fields_stored test creates an EvaluationResult with evaluation_latency=2.5 but never asserts the execution_time DB column; update the test to read row.execution_time and assert it equals the expected value (likely 2.5 if agent latency defaults to 0 or evaluation_latency + agent_latency if agent_latency is present) after saving via SQLStorageBackend.initialize/save_result; reference the EvaluationResult instance, the evaluation_latency field, and the execution_time DB column to ensure the stored execution_time is validated.
🧹 Nitpick comments (3)
tests/unit/core/models/test_summary.py (1)
29-29: ⚡ Quick winConsider including
execution_timein test fixture for completeness.The fixture now sets
evaluation_latency: 1.0but doesn't setexecution_time. Per the PR summary,execution_timeis retained for backward compatibility and should equalevaluation_latency + agent_latency. Consider explicitly settingexecution_timein the fixture to create realistic test data and verify the field is properly handled throughout the summary generation logic.♻️ Suggested addition to fixture
_RESULT_DEFAULTS: dict[str, Any] = { "conversation_group_id": "conv1", "tag": "eval", "turn_id": "turn1", "metric_identifier": "ragas:faithfulness", "result": "PASS", "score": 0.85, "threshold": 0.7, "reason": "Good", "evaluation_latency": 1.0, + "agent_latency": 0.5, + "execution_time": 1.5, "judge_llm_input_tokens": 100, "judge_llm_output_tokens": 50, }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/core/models/test_summary.py` at line 29, The test fixture in tests/unit/core/models/test_summary.py sets "evaluation_latency" but omits "execution_time"; update the fixture to include "execution_time" and set it equal to evaluation_latency + agent_latency (so execution_time = evaluation_latency + agent_latency) to match the PR behavior and ensure summary generation and backward-compat checks (e.g., in any assertions referencing execution_time) use realistic data.tests/unit/core/storage/test_sql_storage.py (1)
53-53: ⚡ Quick winConsider setting
execution_timein test fixture for backward compatibility verification.The fixture sets
evaluation_latency=1.5but doesn't setexecution_timeoragent_latency. Per the PR summary and line 428 of this file,execution_timeis retained as a DB column for backward compatibility. The test fixture should either:
- Explicitly set
execution_timeto verify it's stored correctly, or- Verify the model auto-computes
execution_time = evaluation_latency + agent_latencyWithout this, tests may not catch issues where
execution_timeis NULL or incorrectly computed.♻️ Suggested fixture enhancement
return EvaluationResult( conversation_group_id="conv_001", tag="test", turn_id="turn_1", metric_identifier="ragas:faithfulness", metric_metadata='{"threshold": 0.8}', result="PASS", score=0.95, threshold=0.8, reason="Good response", query="What is Python?", response="Python is a programming language.", evaluation_latency=1.5, + agent_latency=0.0, + execution_time=1.5, api_input_tokens=100, api_output_tokens=50, )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/core/storage/test_sql_storage.py` at line 53, The test fixture in tests/unit/core/storage/test_sql_storage.py currently sets evaluation_latency=1.5 but omits execution_time and agent_latency; update the fixture (the evaluation row/setup used in the tests) to either explicitly set execution_time (e.g., execution_time=<expected value>) to verify DB persistence, or set agent_latency and then add an assertion in the relevant test that execution_time == evaluation_latency + agent_latency so the code path that computes/stores execution_time is exercised; refer to the fields execution_time, evaluation_latency, and agent_latency when making the change.tests/unit/core/models/test_data.py (1)
472-472: ⚡ Quick winAdd an
execution_timedefault assertion to the test
EvaluationResult.execution_timeis defined withdefault=0.0alongsideevaluation_latency=0.0; extendtests/unit/core/models/test_data.py(line 472) to assertresult.execution_time == 0.0in addition toresult.evaluation_latency == 0.0.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/core/models/test_data.py` at line 472, The test currently asserts result.evaluation_latency == 0 but misses the sibling default; add an assertion for EvaluationResult.execution_time by verifying result.execution_time == 0.0 immediately alongside the existing evaluation_latency check (i.e., update the test that references result to include result.execution_time == 0.0).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@tests/unit/core/storage/test_sql_storage.py`:
- Around line 269-314: The test_all_fields_stored test creates an
EvaluationResult with evaluation_latency=2.5 but never asserts the
execution_time DB column; update the test to read row.execution_time and assert
it equals the expected value (likely 2.5 if agent latency defaults to 0 or
evaluation_latency + agent_latency if agent_latency is present) after saving via
SQLStorageBackend.initialize/save_result; reference the EvaluationResult
instance, the evaluation_latency field, and the execution_time DB column to
ensure the stored execution_time is validated.
---
Nitpick comments:
In `@tests/unit/core/models/test_data.py`:
- Line 472: The test currently asserts result.evaluation_latency == 0 but misses
the sibling default; add an assertion for EvaluationResult.execution_time by
verifying result.execution_time == 0.0 immediately alongside the existing
evaluation_latency check (i.e., update the test that references result to
include result.execution_time == 0.0).
In `@tests/unit/core/models/test_summary.py`:
- Line 29: The test fixture in tests/unit/core/models/test_summary.py sets
"evaluation_latency" but omits "execution_time"; update the fixture to include
"execution_time" and set it equal to evaluation_latency + agent_latency (so
execution_time = evaluation_latency + agent_latency) to match the PR behavior
and ensure summary generation and backward-compat checks (e.g., in any
assertions referencing execution_time) use realistic data.
In `@tests/unit/core/storage/test_sql_storage.py`:
- Line 53: The test fixture in tests/unit/core/storage/test_sql_storage.py
currently sets evaluation_latency=1.5 but omits execution_time and
agent_latency; update the fixture (the evaluation row/setup used in the tests)
to either explicitly set execution_time (e.g., execution_time=<expected value>)
to verify DB persistence, or set agent_latency and then add an assertion in the
relevant test that execution_time == evaluation_latency + agent_latency so the
code path that computes/stores execution_time is exercised; refer to the fields
execution_time, evaluation_latency, and agent_latency when making the change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: be024615-6acf-4965-a895-19197c932bc2
📒 Files selected for processing (15)
config/system.yamlsrc/lightspeed_evaluation/core/constants.pysrc/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/core/output/generator.pysrc/lightspeed_evaluation/core/storage/sql_storage.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pytests/script/conftest.pytests/unit/core/models/test_data.pytests/unit/core/models/test_summary.pytests/unit/core/output/test_generator.pytests/unit/core/storage/test_sql_storage.pytests/unit/core/system/test_loader.pytests/unit/pipeline/evaluation/conftest.pytests/unit/pipeline/evaluation/test_errors.pytests/unit/pipeline/evaluation/test_evaluator.py
✅ Files skipped from review due to trivial changes (1)
- src/lightspeed_evaluation/core/constants.py
Description
Depends on #230
Further changes for LEADS-348:
result.execution_timewas renamed toresult.evaluation_latencywhich represents better that it is measuring the evaluationresult.evaluation_latencywas removed for consistency, since other timing metrics are not being roundedresult.execution_timewas reintroduced (for backward compatibility), but currently measures the whole api execution and evaluation per turntest_evaluation.pywas refactored - redundant mocking was moved to pytest fixture (~500 lines reduced)Type of change
Tools used to create PR
Identify any AI code assistants used in this PR (for transparency and review context)
Related Tickets & Documents
Checklist before requesting a review
Testing
Summary by CodeRabbit
New Features
Tests